-
Notifications
You must be signed in to change notification settings - Fork 13.7k
CUDA: add implicit conv3d #16948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CUDA: add implicit conv3d #16948
Conversation
… 3x slower than im2col
|
Quick n dirty test perf run:
|
Need to get tensors working:) |
|
With that last commit I am catching an illegal memory access. |
Give it another try. Tensor core path should be on and working.
|
|
Looks good!
|
|
Please add the perf tests to make_test_cases_perf in test-backend-ops.cpp. |
Will do. Right now, @Green-Sky , does the WAN video generation work ok? |
As far as I can tell yes. I have only tested Wan2.2-TI2V-5B yet (wan2.2 vae). Single "frame" (aka image gen using wan): I don't have the im2col gemm image ready, but it looked identical. Same for 5 frames. I was unable to run higher frame counts due to my 8gig vram limit. It should technically just fit if I restart and don't open extra programs. Not gonna do that today. |
I tried a smaller resolution and I did get another illegal memory access: |
@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks. |
Same issue on last commit and on latest ggml. Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :) |
So it is a ggml issue? |
I can not definitively say that, might be sd.cpp too. But afaik the concat is operating on the outputs of conv3d operations. Maybe we are missing a sync somewhere. |
|
@leejet I suspect sd.cpp is doing something wrong here, because if width and height are multiples of 32, decoding works. But if they are just multiples of 16 I get that illegal memory access error. But it is still possible that it is an error with this pr. |
wan2.2-TI2V-5B-Q8_0-cat-strolling_funkysize.mp4 |
@Green-Sky, I just fixed a bug which may be related to your crash. Please give it a try. Thanks, |
Yes, I can indeed now use multiples of 16, that are not multiples of 32. 🚀 wan2.2-TI2V-5B-Q8_0-cat-strolling_funkysize_16.mp4 |
|
I switched to the fastwan 2.2 TI2V 5B model and generated using 6 steps the most frames+res I could fit. 576x352 93frames fastwan2.2-t2i5b_cat1-steps6.mp4... in 56 seconds total. On CPU this would take probably over 1 hour. |
@Green-Sky , can you compare the performance IM2COL_3D vs IMPLICIT GEMM? Thanks. |
With sd.cpp that is rather impractical for me, I dont have 20Gigs of vram 😄. I can only run a limited selection of the tests that you provide. |
Sure. You have done a lot of great jobs testing these PRs in sd.cpp. Many thanks. |
I will redo this one later #16948 (comment) , but as you can see it did not complete it. :) |
|
Was experimenting more with wan and got another of those illegal access errors in CONCAT. But I am not latest, so I will check latest.
edit: It has indeed been fixed somewhere in the last few commits: :) |
|
You changed the test selection again :P Anyway, here is that one test:
edit: Can you keep them enabled, and not commit your last selection? Also, would be really cool if they where sorted a bit by vram use by im2col+gemm. That would really help bench this on more hardware. |
Sorry, I am just adding more test cases (obtained from sd.cpp wan generation). I want to test these real world cases instead of makeup ones. But in my commit I am going to turn most off because CI will time out otherwise. You can always easily turn them on by uncommenting them. |

This PR adds an implicit conv3d op in CUDA backend, as a complement to IM2COL_3D+GEMM kernel currently used in SD.cpp for video models. It pretty much follows conv2d_implicit.
Using tensor cores, this PR outperforms IM2COL_3D+GEMM. The memory saving is significant, as expected.